Tiered Tagging Revisited

نویسندگان

  • Dan Tufis
  • Liviu Dragomirescu
چکیده

In this paper we describe a new baseline tagset induction algorithm, which unlike the one described in previous work is fully automatic and produces tagsets with better performance than before. The algorithm is an information lossless transformation of the MULTEXTEAST compliant lexical tags into a reduced tagset that can be mapped back on the lexicon tagset fully deterministic. From the baseline tagsets, a corpus linguist, expert in the language in case, may further reduce the tagsets taking into account language distributional properties. As any further reduction of the baseline tagsets assumes losing information, adequate recovering rules should be designed for ensuring the final tagging in terms of lexicon encoding.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Maximum Entropy Tiered Tagging

Data sparseness in tagging highly inflectional languages with large tagsets and scarce training resources is a problem that cannot be addressed using only common tagging techniques. Tiered tagging is a two-stage technique that uses for tagging a smaller ”hidden” tagset and, in the second phase, recovers the original tagset using a lexicon and a set of hand-written rules. The recovering is possi...

متن کامل

Tiered Tagging and Combined Language Models Classifiers

We address the problem of morpho-syntactic disambiguation of arbitrary texts in a highly innectional natural language. We use a large tagset (615 tags), EAGLES and MULTEXT compliant 5]. The large tagset is internally mapped onto a reduced one (82 tags), serving statistical disambiguation, and a text disambiguated in terms of this tagset is subsequently subject to a recovery process of all the i...

متن کامل

A Tiered CRF Tagger for Polish

In this paper we present a new approach to morphosyntactic tagging of Polish by bringing together Conditional Random Fields and tiered tagging. Our proposal also allows to take advantage of a rich set of morphological features, which resort to an external morphological analyser. The proposed algorithm is implemented as a tagger for Polish. Evaluation of the tagger shows significant improvement ...

متن کامل

Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language

Standard methods for part-of-speech tagging suffer from data sparseness when used on highly inflectional languages (which require large lexical tagset inventories). For this reason, a number of alternative methods have been proposed over the years. One of the most successful methods used for this task, FDOOHG 7LHUHG 7DJJLQJ 7XIL , 1999), exploits a reduced set of tags derived by removing severa...

متن کامل

Principled Hidden Tagset Design for Tiered Tagging of Hungarian

For highly inflectional languages, the number of morpho-syntactic descriptions (MSD), required to descriptionally cover the content of a word-form lexicon, tends to rise quite rapidly, approaching a thousand or even more set of distinct codes. For the purpose of automatic disambiguation of arbitrary written texts, using such large tagsets would raise very many problems, starting from implementa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004